1,358 research outputs found

    Large-scale event extraction from literature with multi-level gene normalization

    Get PDF
    Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons -Attribution - Share Alike (CC BY-SA) license

    Exploring Biomolecular Literature with EVEX: Connecting Genes through Events, Homology, and Indirect Associations

    Get PDF
    Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators

    Combining supervised and unsupervised named entity recognition to detect psychosocial risk factors in occupational health checks

    Get PDF
    Introduction: In occupational health checks the information about psychosocial risk factors, which influence work ability, is documented in free text. Early detection of psychosocial risk factors helps occupational health care to choose the right and targeted interventions to maintain work capacity. In this study the aim was to evaluate if we can automate the recognition of these psychosocial risk factors in occupational health check electronic records with natural language processing (NLP). Materials and methods: We compared supervised and unsupervised named entity recognition (NER) to detect psychosocial risk factors from health checks’ documentation. Occupational health nurses have done these records. Results: Both methods found over 60% of psychosocial risk factors from the records. However, the combination of BERT-NER (supervised NER) and QExp (query expansion/paraphrasing) seems to be more suitable. In both methods the most (correct) risk factors were found in the work environment and equipment category. Conclusion: This study showed that it was possible to detect risk factors automatically from free-text documentation of health checks. It is possible to develop a text mining tool to automate the detection of psychosocial risk factors at an early stage</p

    Wide-scope biomedical named entity recognition and normalization with CRFs, fuzzy matching and character level modeling

    Get PDF
    We present a system for automatically identifying a multitude of biomedical entities from the literature. This work is based on our previous efforts in the BioCreative VI: Interactive Bio-ID Assignment shared task in which our system demonstrated state-of-the-art performance with the highest achieved results in named entity recognition. In this paper we describe the original conditional random field-based system used in the shared task as well as experiments conducted since, including better hyperparameter tuning and character level modeling, which led to further performance improvements. For normalizing the mentions into unique identifiers we use fuzzy character n-gram matching. The normalization approach has also been improved with a better abbreviation resolution method and stricter guideline compliance resulting in vastly improved results for various entity types. All tools and models used for both named entity recognition and normalization are publicly available under open license.</p

    Application of the EVEX resource to event extraction and network construction : shared task entry and result analysis

    Get PDF
    BACKGROUND : Modern methods for mining biomolecular interactions from literature typically make predictions based solely on the immediate textual context, in effect a single sentence. No prior work has been published on extending this context to the information automatically gathered from the whole biomedical literature. Thus, our motivation for this study is to explore whether mutually supporting evidence, aggregated across several documents can be utilized to improve the performance of the state-of-the-art event extraction systems. In this paper, we describe our participation in the latest BioNLP Shared Task using the large-scale text mining resource EVEX. We participated in the Genia Event Extraction (GE) and Gene Regulation Network (GRN) tasks with two separate systems. In the GE task, we implemented a re-ranking approach to improve the precision of an existing event extraction system, incorporating features from the EVEX resource. In the GRN task, our system relied solely on the EVEX resource and utilized a rule-based conversion algorithm between the EVEX and GRN formats. RESULTS : In the GE task, our re-ranking approach led to a modest performance increase and resulted in the first rank of the official Shared Task results with 50.97% F-score. Additionally, in this paper we explore and evaluate the usage of distributed vector representations for this challenge. In the GRN task, we ranked fifth in the official results with a strict/relaxed SER score of 0.92/0.81 respectively. To try and improve upon these results, we have implemented a novel machine learning based conversion system and benchmarked its performance against the original rule-based system. CONCLUSIONS : For the GRN task, we were able to produce a gene regulatory network from the EVEX data, warranting the use of such generic large-scale text mining data in network biology settings. A detailed performance and error analysis provides more insight into the relatively low recall rates. In the GE task we demonstrate that both the re-ranking approach and the word vectors can provide slight performance improvement. A manual evaluation of the re-ranking results pinpoints some of the challenges faced in applying large-scale text mining knowledge to event extraction.Computational resources were provided by CSC IT Center for Science Ltd., Espoo, Finland. The work of KH and FG was supported by the Academy of Finland, and of SVL by the Research Foundation Flanders (FWO). YVdP and SVL acknowledge the support from Ghent University (Multidisciplinary Research Partnership Bioinformatics: from nucleotides to networks).http://www.biomedcentral.com/bmcbioinformaticsam201

    Neural Network and Random Forest Models in Protein Function Prediction

    Get PDF
    Over the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence. We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data. In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software at https://github.com/TurkuNLP/CAFA3.</p

    Proceedings of ACL-IJCNLP 2015 System Demonstrations

    Get PDF

    Assisting nurses in care documentation: from automated sentence classification to coherent document structures with subject headings

    Get PDF
    Background: Up to 35% of nurses' working time is spent on care documentation. We describe the evaluation of a system aimed at assisting nurses in documenting patient care and potentially reducing the documentation workload. Our goal is to enable nurses to write or dictate nursing notes in a narrative manner without having to manually structure their text under subject headings. In the current care classification standard used in the targeted hospital, there are more than 500 subject headings to choose from, making it challenging and time consuming for nurses to use. Methods: The task of the presented system is to automatically group sentences into paragraphs and assign subject headings. For classification the system relies on a neural network-based text classification model. The nursing notes are initially classified on sentence level. Subsequently coherent paragraphs are constructed from related sentences. Results: Based on a manual evaluation conducted by a group of three domain experts, we find that in about 69% of the paragraphs formed by the system the topics of the sentences are coherent and the assigned paragraph headings correctly describe the topics. We also show that the use of a paragraph merging step reduces the number of paragraphs produced by 23% without affecting the performance of the system. Conclusions: The study shows that the presented system produces a coherent and logical structure for freely written nursing narratives and has the potential to reduce the time and effort nurses are currently spending on documenting care in hospitals. </div
    • …
    corecore